A whole lotta nothing: Comparing statistical approaches to supporting the null

Dr James Bartlett

Overview

  • Approaches to statistical inference and wanting to test the null

  • Target data set: A Comparison of Students’ Statistical Reasoning After Being Taught With R Programming Versus Hand Calculations (Ditta & Woodward, 2022)

  • How do our inferences change depending on the approach?

    1. Equivalence testing

    2. Bayes factors

    3. Bayesian ROPE

Suitability of a point-null hypothesis

  • NHST: Is the point-null plausible? Do you want to make decisions about how to act with a given error rate? (Lakens, 2021)

  • Meehl’s paradox: With increasing sample size, its easier to confirm a hypothesis via rejecting a point-null (Kruschke & Liddell, 2018)

  • Crud factor: In non-randomised studies, we might expect non-null effects (Orben & Lakens, 2020), but would they be meaningful?

Supporting the null

  • There are scenarios when supporting the null is a desirable inference:
    • Is there no meaningful difference between two competing interventions?
    • Does your theory rule out specific effects?
    • Is your correlation too small to be meaningful?
  • However, researchers mistakenly conclude null effects via a non-significant p-value (Aczel et al., 2018, Edelsbrunner & Thurn, 2020)

Approaches to probability and inference:

Simplest distinction between the approaches (VanderPlas, 2015):

  • Frequentist: Objective theory - Model parameters are fixed and are not subject to a probability distribution

  • Bayesian: Subjective theory - Model parameters are uncertain and subject to probability distributions

Today’s example

  • Technology or Tradition? A Comparison of Students’ Statistical Reasoning After Being Taught With R Programming Versus Hand Calculations (Ditta & Woodward, 2022)

  • Compared conceptual understanding of statistics at the end of a 10-week intro course

  • Students completed one of two versions:

    1. Formula-based approach to statistical tests (n = 57)

    2. R code approach to statistical tests (n = 60)

  • Research question (RQ): Does learning through hand calculations or R code lead to greater conceptual understanding of statistics?

  • Between-subjects IV: Formula-based or R code approach course

  • DV: Final exam (conceptual understanding questions) score as proportion correct (%)

  • Keep in mind the distinction between the RQ/design and the inferences we can make

What are we working with?

Their main results

  • Their first approach to the analysis was a simple independent samples t-test:

    Welch Two Sample t-test

data:  e3total by condition
t = -1.117, df = 110.97, p-value = 0.2664
alternative hypothesis: true difference in means between group HC and group R is not equal to 0
95 percent confidence interval:
 -7.584355  2.116173
sample estimates:
mean in group HC  mean in group R 
        69.29091         72.02500 

What now?

  • We can’t reject the null using a traditional t-test, but how can we test if there was no meaningful difference?

Keeping frequentist

  1. Equivalence testing

Going Bayesian

  1. Bayes factors (the authors report these)

  2. Bayesian Region of Practical Equivalence (ROPE)

1. Equivalence testing

  • Flips the NHST logic and uses one-sided t-tests (90% confidence intervals) to test your effect against two boundaries:

Figure from Lakens (2017)

TOSTER R package

  • Flexible package (Lakens & Caldwell) that can apply equivalence testing to focal tests like t-tests, correlations, meta-analysis

Key decisions to make

  • What alpha value to use?

  • What values to use for the smallest effect size of interest boundaries?

  • Using bounds of ±10%, we can conclude the effect is statistically equivalent and not significantly different to 0:

Welch Modified Two-Sample t-Test
Hypothesis Tested: Equivalence
Equivalence Bounds (raw):-10.000 & 10.000
Alpha Level:0.05
The equivalence test was significant, t(110.97) = 2.968, p = 1.83e-03
The null hypothesis test was non-significant, t(110.97) = -1.117, p = 2.66e-01
NHST: don't reject null significance hypothesis that the effect is equal to zero 
TOST: reject null equivalence hypothesis

TOST Results 
                   t       SE     df      p.value
t-test     -1.117011 2.447684 110.97 2.664022e-01
TOST Lower  2.968483 2.447684 110.97 1.833635e-03
TOST Upper -5.202506 2.447684 110.97 4.542552e-07

Effect Sizes 
                estimate       SE   lower.ci  upper.ci conf.level
Raw           -2.7340909 2.447684 -6.7940673 1.3258855        0.9
Hedges' g(av) -0.2073357 0.188892 -0.5208061 0.1001411        0.9

Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").
  • We can also get a fancy plot showing the equivalence test for both raw and standardised units:
  • However, if we use bounds of ±5%, the difference is not equivalent and not significantly different to 0

What’s our inferences so far?

  • t-test: Not significantly different to 0

  • Equivalence test: Statistically equivalent using bounds of ±10%, but not ±5%

  • Bayes factor: TBD

  • Bayesian ROPE: TBD

2. Bayes factors

  • Relative predictive performance of two competing hypotheses (Van Doorn et al., 2021)

  • How much should we shift our prior belief between two competing hypotheses after observing data?

  • Typically comparing a null model vs an alternative model

    • BF10 = 4.57 would mean data 4.57 times more likely under alternative model than the null model

    • BF01 = 2.34 would mean data 2.34 times more likely under null model than the alternative model

BayesFactor R package

  • Package (Morey & Rouder, 2021) that can apply Bayes factors to t-tests, ANOVA, and regression models etc.

Key decisions to make

  • What is your prior for the alternative hypothesis?

  • What level of evidence would be convincing?

  • Using the default prior, we have weak evidence in favour of the null hypothesis:
Bayes factor analysis
--------------
[1] Null, mu1-mu2=0 : 2.872235 ±0.02%

Against denominator:
  Alternative, r = 0.707106781186548, mu =/= 0 
---
Bayes factor type: BFindepSample, JZS
  • Strength of evidence guidelines (Van Doorn et al., 2021)

    • BF > 1 = Weak evidence

    • BF > 3 = Moderate evidence

    • BF > 10 = Strong evidence

  • We get somewhat consistent conclusions of weak to moderate evidence in favour of the null:

Prior Bayes factor
Medium 2.87
Wide 3.84
Ultrawide 5.25

What’s our inferences so far?

  • t-test: Not significantly different to 0

  • Equivalence test: Statistically equivalent using bounds of ±10%, but not ±5

  • Bayes factor: Weak to moderate evidence in favour of the null hypothesis compared to the alternative

  • Bayesian ROPE: TBD

Bayesian modelling

  • Applies Bayesian inference to regression models (Heino et al., 2018):

    • Define a descriptive model of parameters

    • Specify prior probability distributions for model parameters

    • Update prior to posterior distributions using Bayesian inference

    • Interpret model and parameter posterior distributions

3. Bayesian ROPE

  • Compares the mass (typically a 95% highest density interval / HDI) of parameter posterior distribution to a rejection region

  • Similar to equivalence testing, creates three decisions: 1) HDI outside ROPE, 2) HDI within ROPE, 3) HDI and ROPE partially overlap

Figure from Masharipov et al. (2021)

brms and bayestestR packages

Key decisions to make

  • Prior for each parameter

  • Boundaries for ROPE

  • We can get a summary of our intercept and coefficient from the Bayesian regression model

  • The 95% HDI for the coefficient (mean difference) is entirely within ROPE bounds of ±10%:

Parameter Median Lower 95% HDI Higher 95% HDI ROPE %
2 Intercept 69.46 65.97 72.91 0
1 Condition 2.68 -2.16 7.32 1
  • We can even demonstrate it via a fancy plot of the ROPE and posterior distribution for the coefficient
  • Like equivalence testing though, we’re undecided based on smaller ROPE bounds of ±5%, so we would need more data:

What’s our inferences so far?

  • t-test: Not significantly different to 0

  • Equivalence test: Statistically equivalent using bounds of ±10%, but not ±5%

  • Bayes factor: Weak to moderate evidence in favour of the null hypothesis compared to the alternative

  • Bayesian ROPE: We can accept the ROPE of ±10% around the coefficient posterior, but not ±5%.

Summary

  • RQ: There was no meaningful difference between a formula-based and R code-based course, but question marks on what we can learn from the design

  • Across frequentist and Bayesian approaches, we get pretty similar conclusions, but decisions in data analysis did affect the conclusions:

    • What boundaries do you use for the smallest effect size of interest?

    • What prior would you use for the alternative hypothesis when calculating Bayes factors?

Where to go next

Discussion

Thank you for listening!

Any questions?

  • What is your preferred approach to statistical inference?

  • How would you set smallest effect size of interest boundaries?

  • What approaches have you used to argue there was no meaningful effect?

Technical details

Bayes factor priors

Priors for Bayesian modelling

Informed priors

# Don't run again, but this time specify our informed priors

Ditta_model <- bf(e3total ~ condition)

Ditta_priors <- c(prior(normal(50, 16), class = Intercept),
                  prior(normal(0, 3), class = b))

# Default flat priors
Ditta_fit2 <- brm(
  prior = Ditta_priors, # Specify informed priors
  formula = Ditta_model, # formula we defined above 
  data = Ditta_data, # Data frame we're using 
  family = gaussian(),
  seed = 1908,
  file = "Data/Ditta_model2" #Save the model as a .rds file
)
Model Parameter Median Lower 95% CI Higher 95% CI ROPE %
Default priors Intercept 69.46 65.97 72.91 0
User priors Intercept 69.74 66.68 72.84 0
Default priors Condition 2.68 -2.16 7.32 1
User priors Condition 1.66 -2.14 5.36 1